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Abstract 

Conventional methods of navigating within a virtual reality environment involve the use of 
interfaces such as keyboards, hand-held input devices such as joysticks, mice, and trackballs, and 
hand-worn data gloves. While these devices are mostly adequate, they are rather obstrusive and 
require some amount of training to use. Researchers have begun investigation into interfaces that 
have the capability to interpret human gestures visually. 

In this document, we describe an approach used to navigate virtual reality environments by 
tracking the pose (translation and orientation) of the user's face. This "hands-free" navigation is 
simple, intuitive, and unobstrusive. It requires only commercially available products such as a 
camera and an image digitizer. The pose of the face is determined by warping a reference face 
image to minimize intensity difference between the warped reference face image and the current 
face image. This is more robust because all pixels in the face are used, in contrast to detecting only 
selected facial features. In addition, the proposed approach does not require a geometric model of 
the face. 
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1 Introduction 

Conventional methods of navigating within a virtual reality environment involve the use of in- 
terfaces such as keyboards, hand-held input devices such as joysticks, mice, and trackballs, and 
hand-worn data gloves. While these devices are mostly adequate, they are rather obstrusive and 
require some amount of training to use. In addition, because of the constant physical use and ma- 
nipulation, they either have limited life or require some degree of maintenance. Researchers have 
begun investigation into natural interfaces that are intuitively simple and unobstrusive to the user. 
By natural, we mean communication by way of human gestures and/or speech. 

The proposed approach is designed to address the problem of navigating within a virtual envi- 
ronment without the use of keyboards, hand-held input devices, or data gloves. The approach that 
we take is to track the pose (i.e., translation and orientation) of the face and use that information 
to move and orient the virtual environment accordingly. This method of navigating the virtual 
environment is very intuitive and easy. Part of the novelty of the approach is the tracking of the 
entire face image without the use of any geometric face model. The simplicity of the approach 
allows fast tracking without using specialized image processing boards. The speed of face tracking 
currently achieved is 4 frames per second on a DEC AlphaStation 600 with an image patch size of 
98x80 pixels. 

1.1 Relevant work 

Approaches to controlling interaction in a virtual environment have been mostly limited to using 
hand gestures for games (using a hand worn device such as a Mattel glove) or for manipulating 
virtual objects using a dataglove (e.g., [18]). 

Other work that are relevant to this approach relate to face tracking. In [3], the full face is 
tracked using a detailed face model that relies on image intensity values, deformable model dy- 
namics, and optical flow. This representation can be used to track facial expressions. Due to its 
complexity, processing between frames is reported to take 3 seconds each on a 200 MHz SGI ma- 
chine. Initialization of the face model on the real image involves manually marking face locations, 
and takes 2 minutes on the same SGI machine. 

In [5], the face model in the form of a 3-D mesh is used. However, the emphasis of this work is 
to recognize facial expressions, and it assumes that there is no facial global translation or rotation. 

Work reported in [6, 8] are probably the most relevant to the approach. However, they require 
detection of specific facial features and ratios of distances between facial features. In [6], the gaze 
direction is estimated from the locations of specific features of the face, namely eye corners, tip of 
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the nose, and corners of the mouth. As described in the paper, these features are manually chosen. 
In [8], 3-D head orientation is estimated by tracking five points on the face (four at the eye corners 
and one at the tip of the nose). Again the facial features are selected by hand. 

Also relevant is [14], which describes a real-time (20 frames per second) facial feature tracking 
system based on template matching. The system includes the DataCube real-time image processing 
equipment. The face and mouth areas are extracted using color histogramming while the eyes 
are tracked using sequential template matching. An application cited is the visual mouse, which 
emulates the functionality of the physical mouse through eye position (cursor movement) and 
mouth shape change (clicking operation). Again this system tracks specific features of the face; 
it is not clear if this form of tracking (sequential) is stable over time and whether reliable face 
orientation can be derived from so few features. 

In their work on facial image coding, Li et al. [10] use a 3-D planar polygonized face model 
and assume 3-D affine motion of points. They track the motion of the face model (both local 
and global) using optic flow to estimate the facial action units (based on the facial action coding 
system, or FACS [4]). A feedback loop scheme is employed to minimize the error between the 
synthetically generated face image based on motion estimates and the true face image. However, 
they have to estimate the depth of the face, assumed segmented out, in the scene. The feature node 
points of the face model are manually adjusted to initially fit the face in the image. No timing 
results were given in their paper. 

Azarbayejani et al.'s [1] system tracks manually picked points on the head, and based on re- 
cursive structure from motion estimates and Extended Kalman filtering, determine the 3-D pose of 
the head. The cited translation and rotation error, with respect to the Polhemus tracker estimates, 
are 1.67 cm and 2.4° respectively. The frame rate achieved is 10 frames per second. Their system 
requires local point feature trackers. 

The approach described in [19] is that of block-based template matching. The idea is to take 
many image samples of faces (152 images of 22 people), partition the images into chunks of 
blocks (each of which is 5x7 pixels), and compute statistics of the intensity and strength of edges 
within each block. The results are then used as a template to determine the existence of a face 
in an image as well as its orientation. No timing results are given. In comparison, the initial 
steps of sampling faces and performing statistical analysis of the samples are not required in this 
approach. In addition, for the work described in [19], the orientation of the face is determined by 
interpolating between known sampled face orientation. The approach measures directly the face 
orientation without any interpolation scheme. 
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1.2 The "hands-off" navigation approach 

Our concept of "hands-off" navigation is depicted in Figure 1. The camera is mounted in front and 
above the user; it views the user's face from a tilted angle. The system requires a reference face 
image, and this image is captured for initialization. The reference face image is that of the user in a 
neutral pose, where he/she is facing directly ahead below the camera. To determine the head trans- 
lation and orientation, the face tracker warps the reference face image to minimize the difference 
between the warped reference face image and the current face image. This is equivalent to using 
the reference face image as a globally deformable template. The warping matrix transformation is 
then decomposed to yield the face translation and orientation. Subsequently, the view point of the 
3-D virtual environment changes accordingly. 

The software is written in C, and is run in a DEC AlphaStation 600. The 3-D virtual envi- 
ronment viewer used for our system is VRweb, which is originally developed by the Institute for 
Information Processing and Computer Supported New Media (IICM), Graz University of Technol- 
ogy, in Austria 1 . We customized it to allow TCP/IP communication with the face tracker. 

In navigating a virtual environment, it is very likely that the user would not want to rotate the 
scene about the viewing direction. Hence we adopt this convenient assumption, and disable control 
of rotational motion about the viewing axis (i.e., rotation about t z vector in Figure 1). 

This research is done in connection with the Smart Kiosk project at Cambridge Research Lab, 
Digital Equipment Corp. [20]. The Smart Kiosk can be considered as an enhanced version of 
the Automatic Teller Machine, with the added capability of being able to interact with the user 
through body tracking, and gesture and speech recognition. The "hands-off" capability would 
enable the Smart Kiosk to allow the user to navigate the local virtual environment as an information 
dispensing appliance. The local virtual environment could be created using either direct range data 
or stereo from multiple real images [9]. 

1.3 Organization of document 

We first review the most general global motion tracking, namely full 2-D perspective tracking in 
Section 2. Here we also describe how 2-D motion matrix is decomposed directly into various 
motion parameters such as translation, magnification, and skew. Since the face is assumed to 
be relatively far away from the camera, we can assume an affine model rather than a full 2-D 
perspective model. Section 4 describes how the motion parameters derived from Section 2 can 
be used to extract head translation and orientation. Results are shown in Section 6, followed by 

'The web site for the VRweb browser is http://www.iicm.edu/vrweb. 
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Figure 1: Concept of hands-free navigation. 
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discussion on the implication of the method in Section 7. Finally, we summarize the approach in 
Section 8. 



2 2-D perspective tracking 

The approach in tracking is the same as the approach for image registration described in [13, 16], 
i.e., to directly minimize the discrepancy in intensities between pairs of images after applying the 
transformation we are recovering. This has the advantage of not requiring any easily identifiable 
feature points, and of being statistically optimal once we are in the vicinity of the true solution 
[17]. The technique minimizes the sum of the squared intensity errors 

E = Y\n^y'i)-i{^yi)] 2 = Y.<?i (D 

i i 

subject to the constraint 

x , _ m 00 Xi + m 01 yj + m 02 , _ m w Xj + m xx yj + m 12 
1 rn 20 Xi + m 2 ij/i + 1 ' 1 m 20 Xi + m 21 yi + 1 

The objective function E is applied over the region of interest. 

The minimization of E is done using the Levenberg-Marquardt iterative non-linear minimiza- 
tion algorithm [12]. This algorithm requires the computation of the partial derivatives of ej with 
respect to the unknown motion parameters {m 00 . . . m 2 i}. These are straightforward to compute, 
i.e., 

dei _ %i dl' dei V% f^dl' _ ,81' \ 



dmo Didx' dm-j Di \ 1 dx' 1 dy' J 

where Di is the denominator in (2), and (dl' /dx', dl' /dy') is the image intensity gradient of I' 
at (x'^y'j). From these partials, the Levenberg-Marquardt algorithm computes an approximate 
Hessian matrix A and the weighted gradient vector b with components 

^ dei dei , 0 dei (A . 

^ dmk dmi ^ dmk 

and then updates the motion parameter estimate m by an amount Am = (A + ///) _1 b, where fi 
is a time- varying stabilization parameter [12]. The advantage of using Levenberg-Marquardt over 
straightforward gradient descent is that it converges in fewer iterations. 

To enable the tracker to be more tolerant of larger displacements, we employ a hierarchical 
scheme where coarse estimates are first found at coarse resolutions to be refined at higher resolu- 
tions. In our implementation, we can specify an arbitrary number of resolution levels and iteration 
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2 2-D PERSPECTIVE TRACKING 



at each level. We set the number of resolutions to be 3 and the number of iterations per level to be 
3. 



2.1 Decomposition of full 2-D perspective matrix 

Given the full 2-D perspective matrix, we can decompose it into the following warping parameters: 

• center point or displacement t x and t y (in x and y directions respectively) 

• rotation angle 6j (about the viewing axis) 

• zoom factor £ 

• aspect ratio a 

• skew factor s 

• pinch parameters £ x and £ y (in x and y directions respectively) 
Define 



r x = (a and r v 



c 

a ' 



and let s e = sin 9j and c e = cos Qj. 

The 2-D perspective matrix (which is first scaled such that m 22 
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1) can be decomposed as 



(6) 
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The recovery of the full 2-D perspective matrix from two images is generally relatively unstable 
if these images are small or have little intensity variation. From experiments, this has shown 
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to be true, even with face images as large as about 100 x 100. As a result, we use instead an 
approximation of the 2-D perspective model, namely the 2-D affine model. 
For the 2-D affine case, we set m 20 = m 2 i = 0 and £ x = £ y = 0, giving 
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(7) 



Next, we describe the affine camera, and show the cases when 2-D affine transformation is 
valid. 



3 Affine camera 

The 2-D affine transformation of the image is applicable for the affine camera. The affine camera 
has the projection matrix of the form 
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(8) 



If p is the 3-D point in space and u is the corresponding affine projection, then 

u = Mp + m (9) 

In the affine camera model, all epipolar lines are parallel, and the epipoles are located at infinity 
in the image planes. This camera model is a generalization of the scaled orthographic (also known 
as weak perspective or paraperspective) camera model, which can be used as a good approximation 
if the change in relative object depth is small compared to its distance to the camera. For a fuller 
description of the affine and scaled orthographic camera models, the reader is encouraged to refer 
to the Appendix of [1 1]. 

As shown in [16], full 2-D perspective image transformation can only be used in cases of planar 
surfaces using the perspective camera model and rotation about the camera optic center. The 2-D 
affine image transformation can be used only in the cases of planar surfaces and translation under 
the affine camera model. To see this, we use the derivation similar to [15]. 

Let 3-D point p be a point on a planar patch whose unit normal is n. Let also nj_ 5 i and n^, 2 be 
the other two unit vectors that, with n, form the orthonormal bases of 5R 3 . Thus p can be specified 
as 

p = an ±) i + /?n_L, 2 + An (10) 
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3 AFFINE CAMERA 



Note that p lies on the plane whose equation is p • n = A, A being a constant. 
Let also R be the 3 x 3 rotation matrix such that 



From (9), 
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where M R = MR l . We now partition M R as (B\h), after which we can rewrite (12) as 

u = -B(o;gi + /5g 2 ) + Ab + m 
= 5(agi + /5g 2 ) +b A 



(12) 



(13) 



with h\ = Ab + m. Note that the only variables on the right hand side of (13) that depend on 3-D 
point location on the plane are a and (3. 

Similarly, for another affine camera, we have 



u' = B'(ag 1 + /5g 2 ) + Ab' + m' 
= £'( agl + /5g 2 )+b' A 

Eliminating (agi + /3g 2 ) from (13) and (14) yields 

u' = Tu + e 



(14) 



(15) 



where T = B'B 1 and e = b' A — Th\. Hence u' is an affine transformation of u for points on a 
plane. 

Showing the applicability of the 2-D affine transformation under translation is trivial. If the 
3-D translation is Ap, then 



u' = M(p + Ap) + m = u + MAp 
In this case, the image transform is a translation as well. 



(16) 



(a) (b) (c) (d) 

Figure 2: Effect of tilt (r) on perceived rotation: (a) r = n/2 (top view), and (b) r between 0 and 
7r/2 (view from an angle above). 

4 Using affine tracking to determine limited head pose 

We have shown in the preceding section that 2-D affine transformation is valid for planar surfaces 
with an affine camera model. This is a good approximation if the face is far away enough from the 
camera. This has the effect of rendering the relative depth change in the face to be insignificant 
relative to the distance of the face from the camera. The face can then be approximated as a planar 
surface. 

We capitalize on the decomposition of the 2-D affine matrix to determine head pose (location 
and orientation). However, in navigating a virtual environment, it is very likely that the user 
would not want to rotate the scene about the viewing direction. Hence we adopt this convenient 
assumption, and disable control of rotational motion about the viewing axis. 

To keep the camera relatively unobstrusive to the user, it is better to position it higher up above 
the monitor and allow it to track the user's head from that location at a tilted angle. This location 
has a convenient side-effect; head rotations to either side map result in rotations about the viewing 
axis, which can be easily obtained from the affine matrix decomposition. 

To see this, we consider viewing the head from the top (Figure 2(a)) and viewing the head at 
a tilted angle (Figure 2(b)). We assume that perspective effects are negligible. The point p has 
rotated by an angle 9 to q. Seen from an angle, the corresponding points are p' and q', and the 
perceived rotation angle is 9(a, r); a being the original angle subtended by p with respect to the 
x-axis and r is the tilt angle about the x-axis. 

Without loss of generality, we can assume that both p and q are unit distance from the origin. 
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4 USING AFFINE TRACKING TO DETERMINE LIMITED HEAD POSE 



Hence we get 
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From (17), we can easily recover 9(a, r) 
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9{a, t) = cos -1 



cos a cos(a + 9) + sin a sin(a + 9) sin 2 r 



(cos 2 (a + 9) + sin 2 (a + 9) sin 2 (cos 2 a + sin 2 a sin 2 r j ^ 



(18) 



For the case where the starting head pose is with the head facing horizontally below the camera 
(which we also assume), i.e., a = n/2, from (18) we get 

„/7r „ i / cos ^ sin r \ 

« 9 , t = fl/ = cos" 1 (19) 
^ \vsin 9 + cos 2 ^ sin r/ 

To track the location of the head, we simply track the center of the affine patch (given by t x 
and t y ). Motion in the forward/backward direction is given by the amount of zoom Q. Due to the 
camera tilt, moving the head ahead has the undesirable effect of giving an image displacement in 
the y direction as well. The fix is to disable all other motion while zooming is detected. 

If we know the tilt r (from (22)), the true head rotation is then 

9 = tan -1 (tan 0/ sin r) (20) 

Finally, the head tilt is determined from the amount of y magnification r y ; because the camera 
is situated at a vertical angle with respect to the head, tilting up to face the camera results in larger 
r y (y extent is larger than usual, hence greater than 1), while tilting the head down has the opposite 
effect. In the absence of all other motion parameters, the apparent facial height is (see Figure 3) 

Ay = r y Ay 0 = r y Ay true cos r 0 (21) 

Hence the face tilt angle with respect to the camera is given by 

-i ( s 



cos 



cos 1 (r y cos r 0 ) (22) 



To determine r 0 , we can apply a "calibration" technique in which the user tilts his/her head up and 
down once. The system keeps track of the maximum value of r y , say r y<max . Then 

r 0 = cos" 1 ( ) (23) 
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Figure 3: Starting tilt r 0 of face (represented as planar patch) relative to the camera viewing di- 
rection. Ay true is the true facial height while Ay 0 is the initial apparent facial height. Again we 
assume insignificant perspective effects. 

We are interested in the face tilt angle with respect to the environment rather than with respect to 
the camera. Hence, the actual tilt angle used to control the orientation of the virtual environment 
is the displaced tilt angle given by 



5 Controlling the view 

Even though we are able to extract the 5 pose parameters of the face, we are faced with the problem 
of using them to control the viewing of the virtual reality environment. One simple way would 
be to directly use the pose parameters to determine the absolute position and orientation of the 
viewpoint. However, this limits the viewpoint selection to the pose that the face can assume within 
the camera viewing space. 

The alternative is to control the viewpoint incrementally, i.e., by changing the viewpoint of the 
virtual reality environment in direct response to the change in the pose of the face relative to the 
previous pose. To indicate continuous movement within the virtual reality environment beyond 
what the absolute pose of the face is able to convey, we added the ability for the face tracker to 
detect and respond to the lingering deviation from the reference pose. For example, if the user 
is interested in rotating to the left continuously, he/she would rotate his/her head to the left and 
maintain that posture. The system would respond by first turning the viewpoint of the virtual 
scene. However, because it detected the same deviated face posture longer than a preset time 
threshold (2 seconds in our case), it continues to rotate the viewpoint of the virtual scene in the 
same manner until the head posture changes. 



r' = T — r 0 = cos 1 (r y cos r 0 ) — T 0 



(24) 
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6 RESULTS 



image plane 
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Figure 4: Effect of moving along the camera axis. / is the camera focal length, L is the length 
of the object, z 0 is the reference location, S z is the change in object location, and h Q and h are the 
projected images lengths (in pixels) of the object at positions z 0 and (z 0 + S z ) respectively. 



To minimize the possibility of sudden jumps in consecutive viewpoints, we employ a simple 
Kalman-based filter to smooth the motion trajectory (see, for example, [7]). 

While the orientation angles of tilt and pan can be directly used, we still need the translational 
scaling factors in x, y, and z. These are dependent on the relative scaling of the virtual environment. 
However, converting amount of zoom £ (see Section 2.1) to change in depth z is less direct. From 
Figure 4, let z 0 be the reference depth location of the face. If the face has moved by S z , then by 
similarity of triangles, we have 



with h = (h 0 . Thus, 



from which 



L h 0 L h 

— = — and — = -, (25) 

zo J z 0 + d z / 



= C- (26) 



z 0 + S z f z 0 
S z = zA\- l) (27) 




6 Results 



The speed of face (affine) tracking currently achieved is 4 frames per second on a DEC AlphaSta- 
tion 600 with an image patch size of 98x80 pixels. The number of hierarchical levels is set to 3, 
and the number of iterations in image registration per level is also set to 3. 
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(d) (e) (f) 

Figure 5: Face tracking results: (a) Reference face; (b) Face turned to the left of camera; (c) Face 
turned to the right of camera; (d) Face moved closer to the camera; (e) Face turned upwards; (f) 
Face turned downwards. The monochrome subimage of the face is the computed edge strength 
image. 
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6 RESULTS 




Figure 6: Response by viewer to changes in face pose (see Figure 5: (a) Initial view; (b) View 
rotated to the left; (c) View rotated to the right; (d) View moved closer; (e) View rotated upwards; 
(f) View rotated downwards. 
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Figure 7: Pose parameter against time for head motion in predominantly X and then Y image plane 
directions, (a) and (b) are the unfiltered and filtered versions respectively. 

Examples of plots of X, Y, Z, Pan, and Tilt against the frame number (both before filtering and 
after filtering) are shown in Figures 7-9. X and Y are given in pixels, Z is the scaled change in 
depth as given in (27), and Pan and Tilt are in degrees. As can been seen, the Kalman filter has the 
effect of smoothing the pose parameters. 



7 Discussion 



The face is obviously not planar, and the most robust approach of tracking the face would probably 
be to employ a properly initialized full 3-D face model on the image with a perspective camera 
model, as used by [3]. However, the face model initialization is not trivial, nor is it fast. The 
method described in this article sacrifices some accuracy for ease of initialization and speed of 
tracking by making the basic assumption that the face is far enough from the camera. In our 
application of navigating a virtual environment, absolute accuracy is not necessary. This distance 
assumption allows the scaled orthography camera model to be used, and the face to be assumed 
relatively planar. 

Given the assumptions of significant distance from the camera and object planarity, there is 
also the issue of the camera model. Using 2-D full perspective to extract global image motion 
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7 DISCUSSION 




Figure 9: Pose parameter against time for head motion that includes moving towards and away 
from the camera, (a) and (b) are the unfiltered and filtered versions respectively. 
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is more appropriate but more unstable, while using pure 2-D rigid transform lacks the important 
parameters of scaling and skewing. As shown by our results, the 2-D affine model works well and 
is a good compromise. 

Depending on the speed of head motion, the tracker pose output trajectory can be a little jerky. 
This is due to image noise and limited speed of convergence in image registration. To reduce the 
severity of this problem, we use the Kalman filter; it does help to produce a smoother trajectory of 
navigation. 

8 Summary 

In summary, the key features of our approach to "hands-off" navigation in virtual reality environ- 
ments are: 

• Only one camera is used, and its calibration is not necessary. 

• Initialization is done by just taking a snapshot of the face in a neutral pose (facing directly 
ahead below the camera). This reference face snapshot is used to track the pose of the face. 

• Tracking and pose determination is done using all the pixels in the image patch. This is 
considerably more robust than tracking specific small local features as is usually done [6, 8, 
14]. To reduce dependency of the approach to illumination, we use the edge strength image 
rather than the direct intensity image. A comparative study of face recognition approaches 
seems to favor using templates as compared to using specific geometrical face features [2]. 

At any given instant of time, the reference face image that has been taken during the initial- 
ization step is warped so as to minimize its intensity difference with the current face image. 
The warping transformation matrix is then decomposed to yield both position and orientation 
of the face. This is equivalent to deformable template matching with global motion. 

• A geometric face model is not required. This is a major departure from most existing meth- 
ods of tracking the face [3, 5, 6, 8, 14, 19]. As such, there are no complicated initialization 
steps required of this approach. 

• Facial pose tracking is simplified by the observation that most people would usually not 
navigate a virtual environment by rotating about the viewing axis. The viewing camera is 
mounted in front of and above the user's head at a tilted angle so as to be unobstrusive. 
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• The face, rather than the articulated hand, is being tracked; tracking the articulated hand and 
recognizing hand gestures is considerably more complicated and possibly error-prone. 

We have also shown some results of virtual reality environment navigation by tracking the 
head. 
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